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[57] ABSTRACT 

A computerized method of ordering document clusters for 
presentation after browsing a corpus of documents that 
presents document clusters in a logical fashion in the 
absence of any indication of the computer user's interests. 
The meth od begins by grouping the corpus into a pluralit y 
of clusters, each having a centroid and including at least one 
document Ne xt, for each cluster a degree of similarity 
betwee n that cluster and every other cluster is by findin g a 
do Tproduct between each cl uster centroid an<Lev,cry_other 
cluster centroid. The similarity information is men used to 
determine an order of presentation for the plurality of in a 
way that maximizes the degree of similarity between adja- 
cent clusters. 

11 Claims, 6 Drawing Sheets 
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METHOD OF ORDERING DOCUMENT described that orders and presents document clusters in a 

CLUSTERS WITHOUT REQUIRING logical fashion in the absence of any indication of the 

KNOWLEDGE OF USER INTERESTS computer user's interests. The method begins by grouping 

FIELD OF THE INVENTION the corpus into a plurality of clusters, each having a centroid 

_ ... , . „ . s and including at least one document- Next far each cluster 

The present invention relates to a method of document , ^7^^ ^ ^ ^ other 

dustenng Id partiadar. the pes«t method relates to a £ | a ^ ^ ^ch ' dustcr 

method of logically ordering document clusters for presen- . . ' ° ,*~ . . _ . ., . 

ration a compter user in the absence of indications of user f^rroid and every other cluster oentrad The similarity 

interests information is then used to determine an order of presenta- 

10 tion for the plurality of in a way that maximizes the degree 

BACKGROUND OF THE INVENTION 0 f similarity between adjacent clusters. 

Until recently the conventional wisdom held that docu- Other objects, features, and advantages of the present 

ment clustering was not a useful information retrieval toot invention will be apparent from the accompanying drawings 

Objections to document clustering included its slowness ^ detailed description that follows, 
with large document corpora and its failure to appreciably 15 

improve retrieval. However, when used as an access tool in BRIEF DESCRIPTION OF THE DRAWINGS 
its own right document clustering, often repeated stage after 

stage, can be a powerful technique for browsing and search- ^ present invention is illustrated by way of example 

ing a large document corpus. Pedersen et at describe such 411(1 DOt by way of limitation in the figures of the accompa- 

a document brcwsmgtecrin^ „ nying drawings. In the accompanying drawings similar 

entitled "Scatter-Gather: A Cluster-Based Method and Appa- references indicate similar elements. 

ratus for Browsing Large Document Collections.** FIG. 1 illustrates a prior disorderly cluster summaries for 

Using document clustering stage after stage as its presentation, 

centerpiece, the Scatter-Gather method disclosed by Peder- mG 2 illustrates a computer system for ordering docu- 

sen et al. enables uifcrmatoon access for those with non- mcot dustcn for presentation. 

specific goals, who may not be familiar with the appropriate _ ^ ^ , * . 

vocabulary for describing the topic of interest or who are . 3 *ustrates * method of document cluster summa- 

not looking for anything they are able to specify, as well as rizauon. 

for those with specific interests. Scatter-Gather does so by FIG. 4 illustrates a method of ordering document clusters 

repeatedly scattering the documents of the current collection for presentation in the absence of any user input 

and then gathering them into clusters and presenting sum- 30 pjQ 5 illustrates a method of ordering document clusters 

maries of the clusters to the user. Given this imtial grouping for presentation when document rankings are provided. 

the user may select one ox more dusters, whose documents _1_ ^ .„ , . . , , . . . . fr 

. _ ' . lt i wwnBie AHHJt^noiK/ #^ «aa ~ FIG. 6 illustrates a method of ordering document clusters 

S 'ZZ^Z^ZS^s^Z S£r£ Z ^tatic. when document sadsfactiooindieaton^ 

facilitate a well-specified search or browsing. The docu- 35 * 

menu of this modified sub-corpus are again scattered and DETAILED DESCRIPTION OF THE 

then gathered into new clusters. With each iteration, the PREFERRED EMBODIMENTS 
number of documents in each cluster becomes «naiiw and 

their mutual similarity more detailed. nG - 2 illustrates computer system ». which incorporates 

FIG. 1 illustrates an exemplary presentation and ordering 40 ™«h?d of me i«s«t^eririor. to ordering document 

of duster summaries on a computer screen, whichwere «? ustas Briefly described, the method of 

generated for an initial scattering* i corpus consisting of me f"**™^!" cables computer system 20 to auto- 

the August 1990 articles provided by the New York Times matlc ^ ada for presenUuion to a 

News Service. The first line of each duster summary computer user ir, a lo^ and useM fasmonm the absence 

includes the cluster number, the number of documents in the « of <*y indication from the colter user of t« interest* 

summary, and a number of partial typical titles of articles Colter system 20 does so by finding &e degree of 

within me duster. The second line of each duster summary s«nilanty between each duster and everyotter duster. In 

lists words frequent within the duster. While useful, these °» e enmodrment. computer system 2» orders the clusters so 

duster summaries arc cot as helpful as the tabk of contents Aat . m ? ^S™ 5 of amOarity between adjacent dusters is 

of a conventional textbook because their order of preset- » maximized. In a second embodiment, computer system 20 

tion does not iodicate any relationship or simikrity tetween orders the clusters so that the <iegn^ of^similaritybctwcen 

adiacent clusters distant clusters in the ordering is kept large. When the 

\Zrr* i n * + , , . . . _ number of clusters is large, this ernbodiment greatly reduces 

As FIG- 1 flhistrates. clusters n«*i not be presento! to the me ^ u ^ ^ 

user for consideration one at a time. However, there are ^ similar 

limitations to how many dusters can be presented at a single 33 _ _ _ 

time on a computer screen. The liinitatioiis of display device ™* Document Clustering Computer System 

dimensions and the user's short term memory determine an to a mare detailed discussion of the present 

upper limit on how may clusters can be usefully presented invention, consider computer system 29. Computer system 

at once. If the number of clusters at a particular stage of a 20 deludes monitor 22 for visually displaying information 

particular search exceeds this upper limit it is possible and 60 toa computer user. Computer system 20 also outputs infor- 

often desirable to group those clusters into fewer super- mation to the computer user via primer 24. Computer system 

clusters, replacing what would have been one search stage provides the computer user multiple avenues to input 

by two search stages. data. Keyboard 26 and mouse 28 allow me computer user to 

input data manually. The computer user may also input 

SUMMARY OF THE INVENTION w information by writing on electronic tablet 30 with pen 32. 

A computerized method of ordering document clusters for Alternately, the computer user can input data stored on 

presentation after browsing a corpus of documents will be machine readable media 32. such as a floppy disk, by 
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inserting machine readable media into disk drive 34. Optical 
character recognition unit (OCR unit) 36 permits users to 
input hard copy natural language documents, like document 
38. which it converts into a coded electronic representation, 
typically American National Standard Code for Information 
Interchange (ASCII). 

Processor 40 controls and coordinates the operation of 
computer system 20 to execute the commands of the com- 
puter user. Processor 40 determines and takes the appropri- 
ate action in response to each command by executing 
instructions stared electronically in memory, either memory 
42 or on floppy disk 32 within disk drive 34. Typically, 
operating instructions for processor 40 are stored in solid 
state memory 42. allowing frequent and rapid access to the 
instructions. Devices (hat can be used to implement memory 
42 include standard, commercially available semiconductor 
logic devices such as read only memories (ROM), random 
access memories (RAM), dynamic random access memories 
(DRAM), programmable read only memories (PROM), 
erasable programmable read only memories (EPROM). and 
electrically erasable programmable read only memories 
(EEPROM). such as flash memories. 

B. Document Clustering 

The documents of a corpus must be clustered before the 
order of presentation of the clusters can be determined. The 
clusters at each stage of a search may have been 
precomputed. allowing their use in many other computer 
searches, or the clustering at each stage may be performed 
"on the fly." The most reasonable approach in many situa- 
tions is to precompute the clusters for early stages and to 
compute the clusters in the later stages "on the fly." 

This clustering may be done using a variety of techniques, 
including those described in U.S. Pat No. 5.442.778 to 
Pedersen et aL. which is incorporated herein by reference. 
Typically, clustering algorithms represent each document d 
of a corpus C using an appropriate lexicon. V. The appro- 
priate lexicon will often utilize gentle stemming; Le.. com- 
binations of words that differ by simple suffixes become a 
single term, and usually excludes words found on an 
extended list of stop words. As used herein, stop words are 
words that do little to change the topics of the sentences in 
which they appear. A suitable lexicon may also include 
selected word pairs and might differ from stage to stage of 
a search. 

Some clustering algorithms use a countfile, c(d), to rep- 
resent each document In a countfile each scalar represents 
the number of times each term of the appropriate lexicon, V, 
occurs in document, d. 

A countfile can be expressed as: 

c<d>^fi(co» d)}for fei to M 

where: 

co, is the ith term in lexicon V; and 
f(C0j.d) represents the frequency of the term co, in docu- 
ment d 

While the countfile Includes desirable information beyond 
the presence or absence of the terms of the lexicon being 
used, another function. $(d). often conveys more useful 
information. The function <ftd) at first increases rapidly with 
its argument <t and then more slowly and defines a profile, 
p(d). This profile is expressed as: 

p(d>=Kc(d)). 

FIG. 3 illustrates the major tasks performed prior to 
presenting cluster summaries to a computer user. First 



4 

during step 52 a corpus of documents are grouped into a set 
of k initial clusters. That is to say. me documents of the 
c orpus are organi zed into groups. 'Triat done, attention 
turns to generating a summary for each cluster. Each cluster 

5 summary preferably includes a list of typical, or 
representative, partial document titles and a Jist of frequent 
or representative, terms. During step 54 processor 40 s elects 
typical d ocument titles for each document cluster. These 
tities^mayTSe selected 5~a number of ways. For example. 

to titles can be selected based upon the proximity of their 
document's profile p(d) to the cluster centroid. p. As used 
herein, a cluster centroid p is a vector in which each scalar 
represents the average value of p(d) within the cluster of 
each term co of the appropriate lexicon V. Afterward, during 

is step 56 typical terms are chosen to represent each cluster. 
Again, this can be done in a number of ways. One simple 
way is to select a number of the most frequently used terms 
within the documents of each cluster either by count by 
proportion of the total number of occurrences of each term 

20 in the corpus, or by appearing in the cluster, or a combina- 
tion of these criteria. This information is easily derived 
given countfiles, or profiles, and the terms of the appropriate 
lexicon. 

C. O uster Ordering in the Absence of Indications of 

25 Topics.ofmterest.rrom.me.Coiiuiuter User 

FIG. 4 illustrates in flow diagram form the instructions 60 
executed by processor 40 to determine a logical order to 
present cluster summaries Jn^the absence of an y indication 
from a compu terjus er about the user's int e rests. I nstructions 

30 60 may be stored in solid state memory 42 or on a floppy 
disk placed in disk drive 34. Instructions may be realized in 
any computer language including. LISP and C4+. 

Briefly described, instructions 60 determine the order of 
cluster presentation by determining the degree of similarity 

35 between each cluster and every other cluster. In one embodi- 
ment this information is used to numerically order the 
clusters so that when presented in numerical order the 
degree of similarity between adjacent clusters is substan- 
tially maximi7^d In an alternate embodiment, the ordering 

40 of dustenjs diosenjo force pair s of dustere_distant_from_ j 
e^r^ 

Execution of instructions 60 is initiated In the absence of 
any indication from the computer user about documents of 
particular interest upon receipt of cluster summaries and 

45 cluster centroids. In response to initiatioD, processor 40 
branches to step 62. During step 62 processor 40 actcrmines 
the degree of similarity between each cluster and every other 
cluster. According to instructions 60. cluster similarity is 
measured by the dot product between cluster centroids. In 

50 other words, the similarity between clusters p, and Pj can be 
expressed as (p^p,). Afterward, processor 40 branches from 
step 62 to step 64. 

During step 64 processor 40 takes the similarity measure- 
ments of step 62 and assigns numbers to the dusters such 

55 that when presented in numerical order the degree of simi- 
larity between adjacent clusters is maximized. This can be 
done by treating the problem as an instance of the traveling 
salesman problem in which the distance to be traveled 
between adjacent clusters C, and Cj is (1-(Pj'P>))* 

60 Approaches to solving the traveling salesman problem are 
well known and will not be discussed in detail herein. One 
recent work treating the Traveling Salesman Problem is The 
Traveling Salesman Problem: A Guided Tour of Combina- 
torial Optimization* edited by Lawler et al. Chapter 4 of that 

65 work is incorporated herein by reference. Given that the 
number of clusters k is small, an exact solution, or an 
accurate apprcoimation of the solution, to the traveling 
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salesman problem is not computationally expensive using 
standard heuristics. After solving the problem, a linear 
ordering of the clusters can be generated by breaking the 
cycle at the point of greatest dissimilarity between adjacent 
clusters. 

Other algorithms also can be used to determine cluster 
ordering, including "greedy closeness" algorithms that add 
one link at a tune, linking the dusters of the most similar, 
eligible pair of clusters. "Greedy distance" algorithms can 
also be used to order the clusters. These algorithms build up 
ordered subsets of the clusters by iteratively finding the 
centroid of the clusters furthest from the clusters already 
placed in the subset and placing that cluster where its 
centroid is closest to me centroids already in the list. 

Having determined the ordering, processor 40 branches 
from step 64 to step 66. Processor 40 men presents the 
cluster summaries in the order determined previously. The 
document summaries may be presented to the user via 
monitor 22, printer 24 and/or stored to solid state memory 42 
for later display. Ouster presentation complete, processor 40 
branches from step 66 to step 68, returning control to the 
routine that called instructions 60. 

D. Cluster Ordering Given External Information 

1. Ouster Ordering Based on Document Rank 
fp? FIG. 5 illustrates in flow diagram form the instructions 80 
executed by processor 40 to determine a logical order to 
p resent cluster summaries whe n jhe documents of t he corpus 
have been ranked, often w ith ties^on the basis of previous 
search histories, either those of the user or by a member of 
a group_whoseintejests.ar e bclievedJo be simijarto.t hat of 
the present user. This ranking may reflect accumulated data 
on documents finally selected in earlier searches. 
Alternatively, the ranking may reflect the scoring inherent in 
a similarity search. (See G. Salton and M J. McGilL 
Introduction to Modern Information Retrieval**. McGraw- 
Hill 1983, for a discussion of similarity searches.) 

Instructions 80 determine an order of presentation given 
a prior user query, which is used as an indication of the 
computer user's interests. Briefly described, instructions 80 
determine the order of cluster presentation using document 
rank to determine a cluster rank. Ties between ranked 
clusters may be broken by treating the tied clusters in the 
manner discussed previously with respect to FIG. 4 and 
instructions 60. Ouster summaries are then presented to the 
computer user in the resulting rank order. Instructions 80 
may be stored In solid state memory 42 or on a floppy disk 
placed in disk drive 34. Instructions may be realized in any 
computer language including. LISP and C++. 

Execution of instructions 80 is initiated upon receipt of a 
ranked, tie-broken, and clustered document corpus, ftoces- 
sor 40 responds to initiation by advancing to step 82. During 
step 82 proces sor 49 determines a rank for each cluster based 
upon the rank of a document d within that cluster. In one 
embodiment, the rank r of cluster C, is equal to the rank of 
the cluster* s most desirable document r(d). That is to say, if 
low rankings are defined as desirable, then the rank: of the 
duster will be set to that of the cluster* s lowest ranking 
document Stated mathematically: 

where 

dec,. 

n (3 Alternatively, other methods can be used during step 82 to 
' determine cluster rank. For example, clu ster rank can be set 
equal to the median document rank of acluster, equal to the 
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average document rank, or equal to the total rank of a subset 
of the lowest ranking documents in the cluster, e.g. the ten 
lowest ranking documents or the eighth or nonth lowest 
ranking documents. 
5^tj Alternatively, other information can be usedjo^rank^ 
clu sters directly. Such informatio n includes knowle^ge^oX 

being processed, or. more often, knqwjgj ge of cnolces by 
gr oupscflmiilarry interested users,. Agai n, ties betwe en 
ranked clusters can be broken using, the ~meflS5d described 
previously with respect to FIG. 4 and instructions 60. 

Having ranked the clusters, processor 40 branches to step 
84 from step 82. During step 84 processor 40 uses document 
rank to determine the order of presentation for the document 
titles, which make up part of each cluster summary, and 

15 breaks ties between documents when necessary as discussed 
with respect to instructions 60. Processor 40 orders the 
presentation of the clusters so Chat within the cluster sum- 
maries the titles of documents with lower rankings are 
presented before those with higher rankings, step 84. 

20 Alternatively during step 84 the summary of each cluster 
can be modified by replacing the partial tides mat make up 
part of the summary with an equal number of partial tides 
from the documents in the cluster that have the lowest ranks. 
Ties between documents having the same rank can be 

23 broken in the manner previously discussed. As before, the 
partial titles of documents with lower ranks are presented 
before those with higher ranks. Processor 40 men branches 
from step 84 to step 86. 
Having determined the order of presentation of partial 

30 titles within each summary during step 84. processor 40 
advances to step 86. During that step processor 40 presents 
the cluster summaries in cluster rank order. The document 
summaries may be presented to the user via monitor 22. 
printer 24 and/or stored to solid state memory 42 for later 

35 display. Ouster presentation complete, processor 40 
branches from step 86 to step 88. returning control to the 
routine that called instructions 80. 

2. Ouster Ordering Based on Binary Document Scores 
FIG. 6 illustrates in flow diagram form the instructions 

40 100 executed by processor 40 to determine an order to 
present cluster summaries when a boolean constraint, pref- 
erably structured as a combination of partial constraints, has 
been applied to the document corpus by the computer user. 
That is to say, instructions 100 treat satisfaction of the 

45 boolean constraint as an indication of the computer user's 
interest in a particular document which is then used to order 
the document clusters fox presentation. Instructions 100 may 
be stored in solid state memory 42 or on a Soppy disk placed 
in disk drive 34. Instructions may be realized in any com- 

50 puter language including. LISP and Gh-. 

Execution of instructions 100 is initiated upon receipt of 
a clustered document corpus, and satisfaction indicators. 
1(d), which indicate fox each document whether that docu- 
ment satisfies the user's boolean constraint Processor 40 

55 responds to initiation by advancing to step 102. 

During step 102 processor 40 calculates a score for each 
cluster based upon the number of documents with the cluster 
that satisfy the boolean constraint IF when a document d 
satisfies the computer user's boolean constraint l(d>l. and 

60 if l(d>0 when doormenf d does not satisfy the boolean 
constraint then the score s f or a cluster C, can be calculated 
in a number of ways. In one embodiment the cluster score 
is the sum of satisfaction indicators for that cluster. 
Stated mathematically: 

63 

where 
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Id yet another embodiment the cluster score can be 
calculated as the sum of satisfaction indicators divided by 
the number of documents in the cluster 

where IQJ is the number of documents in Q. 

If the total number of documents that satisfy the boolean 
constraint in any particular set of clusters is too small, say 
less than 25. the ranting process should be modified to 
consider those documents that only partially satisfy the 
boolean constraint In such circumstances, if the boolean 
constraint requires concatenations of two parts. B and C. 
then the score for each cluster can be taken as the lesser of 
the number of documents satisfying B and the number of 
documents satisfying C. or some other combination of the 
two numbers. The generalization to kpartials is immediate. 

Having scored all the clusters, processor 40 branches 
from step 102 to step 104. During step 104 processor 40 uses 
the cluster scores previously generated to determine the 
order of duster presentation and then presents the cluster 
summaries in that order. Processor 40 presents first the 
cluster including the greatest number of documents satisfy- 
ing the boolean constraint, next the cluster including the 
second greatest number of documents satisfying the boolean 
constraint, and so on. The document summaries may be 
presented via monitor 22. printer 24 and/or stored to solid 
state memory 42 for later display. Ouster presentation 
complete, processor 40 branches from step 104 to step 106. 
returning control to the routine that called instructions 100. 

E. Conclusion 

Thus, a method has been described for determining clus- 
ter ordering for presentation to a computer user in a way that 
emphasizes topic similarity. The method uses similarity 
between clusters to order their presentation in the absence of 
any information about topics of interest to the computer user. 

In the foregoing specification, the invention has been 
described with reference to specific exemplary embodiments 
thereof. It wilL however, be evident that various modifica- 
tions and changes may be made thereto without departing 
from the broader spirit and scope of the invention as set forth 
in the appended claims. Accordingly, the specification and 
drawings are to be regarded in an illustrative rather than a 
restrictive sense, 

What is claimed is: 

1. A method of ordering document clusters for presenta- 
tion after browsing a corpus of documents using a computer 
including a processor and a memory coupled to the 
processor, the processor implementing the method by 
executing instructions stored in the memory, the method 
comprising the steps of: 

a) grouping the corpus into a plurality of clusters, each 
cluster having a centroid including at least one docu- 
ment; 

b) determining for each cluster a degree of similarity 
between the cluster and every other cluster by rinding 
a dot product between the cluster centroid and every 
other duster centroid; 

c) detennining an order of presentation of the plurality of 
clusters based upon the degree of similarity between 
clusters to mflT ' m " f> the degree of similarity between 
adjacent clusters; and 

d) presenting a number of the plurality of clusters to the 
computer user in the order of presentation. 
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2. The method of claim 1 further comprising the step of: 
e) generating a summary for each cluster. 

3. The method of claim 2 wherein each summary includes 
a number of typical titles of documents within the cluster. 

4. The method of claim 2 wherein each summary includes 
a number of frequently used terms within the cluster. 

5. The method of claim 1 further comprising the step of: 
e) generating a summary for each cluster. 

6. The method of claim 5 wherein each summary includes 
a number of typical titles of documents within the cluster. 

7. The method of claim 6 wherein each summary includes 
a number of frequently used terms within the cluster. 

8. A method of ordering document clusters for presenta- 
tion after browsing a corpus of documents using a computer 
including a processor and a memory coupled to the 
processor, the processor Implementing the method by 
executing instructions stored in the memory, the method 
comprising the steps of: 

a) grouping the corpus into a plurality of clusters, each 
cluster having a centroid and including at least one 
document, and each document including a rank gener- 
ated in response to a query of the computer user; 

b) generating a cluster summary for each of the plurality 
of clusters; 

c) if there has been no external input indicative of topics 
of interest to the computer user detennining for each 
cluster a degree of similarity between the cluster and 
every other cluster by finding a dot product between the 
cluster centroid and every other cluster centroid; 

d) determining an order of presentation of the plurality of 
clusters based upon the degree of similarity between 
clusters to maximize the degree of similarity between 
adjacent clusters; and 

e) presenting the plurality of clusters to the computer user 
in the order of presentation. 

9. The method of claim 8 wherein each summary includes 
a number of typical titles of documents within the cluster. 

10. The method of claim 9 wherein each summary 
includes a number of frequently used terms within the 
cluster. 

11. An article of manufacture comprising: 

a) a memory; and 

b) instructions stored by the memory, the instructions 
representing a method of ordering document clusters in 
the absence of knowledge of user interests, the method 
of comprising the steps of: 

1) grouping the corpus into a plurality of clusters, each 
cluster having a centroid and including at least one 
document and each document including a rank 
generated in response to a query of the computer 
user, 

2) if there has been no external input indicative of 
topics of interest to the computer user determining 
for each cluster a degree of similarity between the 
cluster and every other cluster by finding a dot 
product between the cluster centroid and every other 
cluster centroid; 

3) detennining an order of presentation of the plurality 
of clusters based upon the degree of similarity 
between dusters; and 

4) presenting the plurality of clusters to the computer 
user in the order of presentation. 



06/16/2003, EAST Version: 1.04.0000 



